By William Weihnacht and Daniel Song
In recent years, League of Legends has grown not only in popularity as the most played e-sports game but also as a venue for data analytics. League of Legends is arguably the most played video game on PC and the most watched e-sport on the planet, with daily viewers in the hundreds of thousands on Twitch. As a video game, it has intricately complex dynamics where two teams of five players each compete in a part arms-race part-battle to basically eliminate one another’s towers and central nexus. Each game is like a mini-war that usually resolves within 30 minutes to an hour depending on how quickly one team is able to gain an advantage over the other. League of Legends currently consists of 140+ distinct champions (avatars) each with unique abilities and properties, yielding different advantages and disadvantages. And Riot Games (the creator of League of Legends) is consistently churning out new champions every few months, resulting in a constantly shifting meta. Each base is protected by a set of defensive towers that shoot at the first enemy they come across. The nexus, or command center, of each base generates AI-controlled minions that serve as foot soldiers for each team and are killed by the opponents to amass gold and experience points used to strengthen their abilities. The game ends when one of the Nexuses are destroyed. For more information on the rules click here.
The dynamics of the game can become quite complicated as all these factors stack together and as such analyzing this game can be rather complicated. However, as it has arguably remained the figure head of e-sports over the past 10 or so years since its release there is a lot of scrutiny on matchup statistics and match analysis due to its size. In our project, we hope to shed some light on the driving forces of the current meta as well as model the likelihood of a team winning given their current resources.
To collect the data for our project we were lucky in that there is a plethora of League related data widely accessible including Riot’s own match data API. However, instead of scraping data from random public matches, we chose to collect data from an already existing professional League analysis website named Oracle’s Elixir so we can analyze the game at the highest level of play and in the current meta. This meant the main challenge here was reformatting it into a way that we wanted to use it.
Oracle's Elixir is the No. 1 LoL Esports Statistics + Analytics Website founded by Tim Sevenhuysen.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api
#read in dataset
data = pd.read_excel('./2019-summer-match-data-OraclesElixir-2019-06-24.xlsx')
To start off there is much more data than we need to get a good idea of the current meta game so we can go ahead and select the most relevant 45 of the 98 columns.
#select relevent columns
less_data = data[['date','gameid','playerid','player','team','side','gamelength','result','k','d','a','teamkills','teamdeaths','fb','kpm','okpm','fd','fdtime','teamdragkills','oppdragkills','ft','firstmidouter','firsttothreetowers','teamtowerkills','opptowerkills','fbaron','teambaronkills','oppbaronkills','dmgtochampsperminute','wpm','wcpm','totalgold','goldspent','csat10','oppcsat10','csdat10','goldat10','oppgoldat10','gdat10','goldat15','oppgoldat15','gdat15','xpat10','oppxpat10','xpdat10']]
Now that we have a DataFrame full of statistics we can start to break it down into more relevant subsets. We want to look at performance based on individual champions, as well as what the most successful teams are prioritizing doing.
To get clean data for team stats we will simply select all rows where the player is listed as team, which means the column contains the team averages for that game. Once we have aggregated these we group all rows that correspond to the same team together and average them to get a table of team averages across all games this tournament. Doing so makes the variables playerid, team kills, and team deaths, redundant or obsolete so they are dropped.
All our data is grouped by teams. As a result, the different teams in our dataset serve as the index for the dataframe. This will make a difference when looking at distributions of win percentages in the machine learning section, where the distribution will show the average number of teams on the y-axis and the win percentage on the x-axis.
#team stats for each game
team_stats = less_data.loc[less_data['player'] == 'Team']
#team data averaged across all games
avg_team_data = team_stats.groupby(['team']).mean()
avg_team_data.drop(['playerid','teamkills','teamdeaths'],1,inplace=True)
avg_team_data.head()
Next, we look at individual player data with the goal of finding what champions are used to the greatest effect. Some data is listed as either and empty string or a single space instead of “not a number” so we change both to NaN then drop them all at once.
#collect average champion data
champ_data = data[['champion','position','result','earnedgoldshare','dmgshare','wardshare']]
champ_data.replace('', np.nan, inplace=True)
champ_data.replace(' ', np.nan, inplace=True)
champ_data.dropna(inplace=True)
champ_data.head()
First off, since I have seen theories that teams wearing a certain color might have a slight advantage in various sports and video games so I thought it would be interesting. I was impressed when I found that blue teams have one 55.6% of the time. After some research, I found that this is because the blue team gets to pick the first champion giving them a statistical edge in pro play.
#calculate side vs win%
r = 0
b = 0
for index,row in team_stats.iterrows():
if row['side'] == 'Red' and row['result'] == 1:
r += 1
elif row['side'] == 'Blue' and row['result'] == 1:
b += 1
plt.pie([r,b],labels=['Red','Blue'],colors=['Red','Blue'],autopct='%1.1f%%')
plt.title("Win rate by side")
plt.show()
I now look at what variable present in the data would likely be the biggest indicator of winning the match: first to three towers. Since this is the minimum number of towers you must destroy to attack the main objective it makes sense that this would be a big indicator of success. We can see that being first to three towers means you have an almost 40% chance of winning.
#calculate first to 3 towers vs win%
f_w = 0
f_l = 0
nf_w = 0
nf_l = 0
for index,row in team_stats.iterrows():
if row['firsttothreetowers'] == 1 and row['result'] == 1:
f_w += 1
elif row['firsttothreetowers'] == 1 and row['result'] == 0:
f_l += 1
elif row['firsttothreetowers'] == 0 and row['result'] == 1:
nf_w += 1
elif row['firsttothreetowers'] == 0 and row['result'] == 0:
nf_l += 1
plt.pie([f_w,nf_w,nf_l,f_l],labels=['First and won','Not first and won','Not first and lost','First and lost'],colors=['MediumSeaGreen','Green','Red','Salmon'],autopct='%1.1f%%')
plt.title("Win rate by first to three towers")
plt.show()
Based on this it makes sense that the first tower will give you slightly less but still significantly increased odds of winning. It’s interesting that the first tower gives an 8.7% bump but getting the next two first will only increase your chance of winning another 5.7%. This is testament to the fact that resources can be used to “snowball” an advantage and their value diminishes as time goes on.
#calculate first tower vs win%
ft_w = 0
ft_l = 0
nft_w = 0
nft_l = 0
for index,row in team_stats.iterrows():
if row['ft'] == 1 and row['result'] == 1:
ft_w += 1
elif row['ft'] == 1 and row['result'] == 0:
ft_l += 1
elif row['ft'] == 0 and row['result'] == 1:
nft_w += 1
elif row['ft'] == 0 and row['result'] == 0:
nft_l += 1
plt.pie([ft_w,nft_w,nft_l,ft_l],labels=['First and won','Not first and won','Not first and lost','First and lost'],colors=['MediumSeaGreen','Green','Red','Salmon'],autopct='%1.1f%%')
plt.title("Win rate by first tower")
plt.show()
Similarly, first blood only gives a 3.6% increased chance of winning which is reasonable because it only nets you 400 gold as opposed to 1000 for a tower or 300 for a normal kill.
#calculate first blood vs win%
fb_w = 0
fb_l = 0
nfb_w = 0
nfb_l = 0
for index,row in team_stats.iterrows():
if row['fb'] == 1 and row['result'] == 1:
fb_w += 1
elif row['fb'] == 1 and row['result'] == 0:
fb_l += 1
elif row['fb'] == 0 and row['result'] == 1:
nfb_w += 1
elif row['fb'] == 0 and row['result'] == 0:
nfb_l += 1
plt.pie([fb_w,nfb_w,nfb_l,fb_l],labels=['First and won','Not first and won','Not first and lost','First and lost'],colors=['MediumSeaGreen','Green','Red','Salmon'],autopct='%1.1f%%')
plt.title("Win rate by first blood")
plt.show()
Looking at first to kill a dragon, it is more significant than first blood but less than first tower. Dragons are important for building advantages as their buff lasts the entire game but don’t give the same gold and lane control advantage as a tower.
#calculate first dragon vs win%
fd_w = 0
fd_l = 0
nfd_w = 0
nfd_l = 0
for index,row in team_stats.iterrows():
if row['fd'] == 1 and row['result'] == 1:
fd_w += 1
elif row['fd'] == 1 and row['result'] == 0:
fd_l += 1
elif row['fd'] == 0 and row['result'] == 1:
nfd_w += 1
elif row['fd'] == 0 and row['result'] == 0:
nfd_l += 1
plt.pie([fd_w,nfd_w,nfd_l,fd_l],labels=['First and won','Not first and won','Not first and lost','First and lost'],colors=['MediumSeaGreen','Green','Red','Salmon'],autopct='%1.1f%%')
plt.title("Win rate by first dragon")
plt.show()
Next, we will look at individual role/champion data. First, I count the amount of times each champion was played at a certain position and remove champions who were not used at least 10 times in a position. Since we are dealing with averages having outliers such as a champion who was used once at position and won would skew the data.
The Support and ADC (usually a melee range champion) operate in the bottom lane. The remaining roles are named after the lanes/area they operate in: Top operates in the top lane. Mid operates in the middle lange. Jungle can operate in all lanes when wanting to support his/her teammates but mainly operates in the jungle (forested areas in between the lanes).
#count each time a champion was picked for each position
pick_counts = champ_data.groupby(['position','champion']).size()
#remove data for champions used less than 10 times at a given position
champ_data2 = []
for index,row in champ_data.iterrows():
if pick_counts[row['position']][row['champion']] > 10:
champ_data2.append(row)
champ_data2 = pd.DataFrame(champ_data2)
Here I group the data by position and champion and average it, rename result to win percent for clarity, and then split the DataFrame by position played, and made a little function to graph each position cleanly.
#Average the data
avg_position_data = champ_data2.groupby(['position','champion']).mean()
avg_position_data.rename({'result':'Win Percent'},inplace=True,axis=1)
#Split the data by position
top,mid,adc,jg,sup = [],[],[],[],[]
for index,row in avg_position_data.iterrows():
if index[0] == "ADC":
adc.append(row)
if index[0] == "Jungle":
jg.append(row)
if index[0] == "Middle":
mid.append(row)
if index[0] == "Support":
sup.append(row)
if index[0] == "Top":
top.append(row)
#function to clean the data for graphing
def reindex(pos):
df = pd.DataFrame(pos)
i = []
for index,row in df.iterrows():
i.append(index[1])
df.index = i
return df
#applying the function to the dataframes created
top = reindex(top)
mid = reindex(mid)
adc = reindex(adc)
jg = reindex(jg)
sup = reindex(sup)
In the top position the big winner jumps out a Sylas with a win percentage of over 60% over 24 games. Sylas was a strong Mid-lanner in patch 9.1 but it is a testament to his utility that he is used in top lane too. Interestingly the overall damage and gold share of the most successful heroes in this rank is lower than that of the more middle of the pack ones which makes sense in a way because top is an isolated position and one can have success in their own lane despite their team losing control in other lanes.
top.sort_values('Win Percent').plot.bar()
plt.ylabel("Percentage")
plt.show()
In the middle role, we see Sylas is also dominant here, but on par with Yasuo, another strong pick. There is a lot of variety in this position and some overlap with top as they have similar roles. Akali, for example is used in both but to more success in middle than top likely because she is more of a ganker (ganking refers to ambushing an enemy in a different lane). Once again some of the lower win-rate heroes have good damage and gold share but this can once again potentially be attributed to the bottom-lane failing to over-power the enemy and being pushed in, leading to higher proportional sucsess from the more individual middle and top lanes.
mid.sort_values('Win Percent').plot.bar()
plt.ylabel("Percentage")
plt.show()
The ADC role appears to have more of a set meta with only 8 commonly used champions as opposed to the 15 used at middle. ADC, which stands for attack damage carry, is tasked with outputting the majority of team damage towards the end of the game. While Kalista has by far the highest win rate this can be a little deceiving as she was only used 11 times compared to the 92 of Sivir and 90 of Xayah. With a respectable win rate of 55% in 28 games Varus seems to be one of the stronger damage dealers at this position, while the supposedly meta aformentioned Xayah and Sivir saw mediocre win rates.
adc.sort_values('Win Percent').plot.bar()
plt.ylabel("Percentage")
plt.show()
Junglers are tasked with killing their jungles spawns, stealing the enemies, and ganking. Once again, Karthus has a deceptively high win rate given he has only played 11 games but nonetheless he is an effective damage dealer but lacks the utility of other champions. Trundle is also effective although he lacks damage he provides utility by hindering the enemies in gank situations to secure extra kills, accounting for his low damage share yet high success.
jg.sort_values('Win Percent').plot.bar()
plt.ylabel("Percentage")
plt.show()
The support position is tasked with vision control and enabling the ADC and other roles through buffs. Once again, the highest win percentage, Pyke, is from a smaller sample size. However, unlike some other roles two of the most played heroes are also among the most successful. Rakan is strong as a standard healer while Tahm Kench has an ultimate that provides motility to teammates and a stun.
sup.sort_values('Win Percent').plot.bar()
plt.ylabel("Percentage")
plt.show()
Overall there are some champions such as Sylas and Varus that are used frequently and yield a high win rate. However, it seems that often the most successful picks are situational and while not outright the strongest at their position can be slotted into a team that synergizes with them or as a counter pick that allows them to take advantage of the other team’s weaknesses. This is a testament to the fluid nature of the game and that while some champions are overall strong at certain points in time there is generally room for counter play, meaning that the game is well designed and balanced, and that ingenuity will prevail over spamming “meta picks”.
from sklearn.model_selection import train_test_split
from scipy.stats import f
import seaborn as sns
from sklearn import model_selection
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
import statsmodels.formula.api as smf
import warnings
from sklearn.model_selection import KFold
warnings.filterwarnings('ignore')
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics
from sklearn import datasets, linear_model
from mpl_toolkits import mplot3d
Reference for gameplay factors: http://oracleselixir.com/match-data/match-data-dictionary/
by_team_data = data.loc[less_data['player'] == 'Team']
by_team_data.fillna(0, inplace = True)
team_data = by_team_data.groupby(['team']).mean()
result = team_data[['result']]
team_data.head()
# champ_kills is a sub-dataframe of team_data based on the Champion Kills factors.
champ_kills = team_data[['kpm','teamkills','a']]
# Set X as champion kills features
X = champ_kills
# set our win_percentage as y
y = result
# Split data into Train and Test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
# run the linear regression on the training set using sklearn library
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
# run the linear regression on the training set using statsmodel library
champ_kills_LRM = smf.OLS(y_train, X_train).fit()
# Predict win percentage using our regression model on the Test data
preds_champKillsLRM = champ_kills_LRM.predict(X_test)
# Predict win percentage using K-fold cross validation on the Test data
KFCV_champKillsLRM = cross_val_predict(model, X, y, cv=5)
# Plot the predicted values (linear regression, k-fold cross validation) against the actual values
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted based on Champion Kills')
# plot actual values for win_percentage
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
# plot linear regression values for win_percentage
sns.distplot(preds_champKillsLRM, hist=False, label="Linear Regression Predictions", ax=ax)
# plot linear regression values for k-fold cross validation
sns.distplot(KFCV_champKillsLRM, hist=False, label="K-fold Cross Validation Predictions", ax=ax)
ax.set(xlabel="Win Percentage", ylabel="Avg Number of Teams")
plt.show()
champ_kills_LRM.summary()
# not_dying is a sub-dataframe of team_data based on the Champion Kills factors.
not_dying = team_data[['teamdeaths']]
# Set X as not_dying dataframe
X = not_dying
# y is already set as our win_percentage
# Split data into Train and Test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
# run the linear regression on the training set using sklearn library
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
# run the linear regression on the training set using statsmodel library
notDying_LRM = smf.OLS(y_train, X_train).fit()
# Predict win percentage using our regression model on the Test data
preds_NotDying = notDying_LRM.predict(X_test)
# Predict win percentage using K-fold cross validation on the Test data
KFCV_NotDying = cross_val_predict(model, X, y, cv=5)
# Plot the predicted values (linear regression, k-fold cross validation) against the actual values
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted based on Not Dying')
# plot actual values for win_percentage
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
# plot linear regression values for win_percentage
sns.distplot(preds_NotDying, hist=False, label="Linear Regression Predictions", ax=ax)
# plot linear regression values for k-fold cross validation
sns.distplot(KFCV_NotDying, hist=False, label="K-fold Cross Validation Predictions", ax=ax)
ax.set(xlabel="Win Percentage", ylabel="Avg Number of Teams")
plt.show()
notDying_LRM.summary()
teamdragkills = Total dragons killed by team.
herald = Rift herald taken (1 yes, 0 opponent took it, blank herald not killed).
fbaron = First baron of game killed (1 yes, 0 no).
teambaronkills = Total barons killed by team.
heraldtime = Herald kill time, in minutes.
fbarontime = First baron time, in minutes.
With these multiple factors, we run a linear regression to fit the data.
# largeMonsterKills is a sub-dataframe of team_data based on the "Large Monster Kills" factors.
largeMonsterKills = team_data[['teamdragkills','herald','heraldtime','fbaron','fbarontime','teambaronkills']]
# Set X as largeMonsterKills dataframe
X = largeMonsterKills
# y is already set as our win_percentage
# Split data into Train and Test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
# run the linear regression on the training set using sklearn library
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
# run the linear regression on the training set using statsmodel library
largeMonsterKillsLRM = smf.OLS(y_train, X_train).fit()
# Predict win percentage using our regression model on the Test data
preds_LargeMonsterKills = largeMonsterKillsLRM.predict(X_test)
# Predict win percentage using K-fold cross validation on the Test data
KFCV_LargeMonsterKills = cross_val_predict(model, X, y, cv=5)
# Plot the predicted values (linear regression, k-fold cross validation) against the actual values
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted based on Monster Kills')
# plot actual values for win_percentage
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
# plot linear regression values for win_percentage
sns.distplot(preds_LargeMonsterKills, hist=False, label="Linear Regression Predictions", ax=ax)
# plot linear regression values for k-fold cross validation
sns.distplot(KFCV_LargeMonsterKills, hist=False, label="K-fold Cross Validation Predictions", ax=ax)
ax.set(xlabel="Win Percentage", ylabel="Avg Number of Teams")
plt.show()
largeMonsterKillsLRM.summary()
tower kills
ft = First tower of game killed (1 yes, 0 no).
firstmidouter = First team to kill mid lane outer tower (1 yes, 0 no).
firsttothreetowers = First team to kill three towers (1 yes, 0 no).
teamtowerkills = Total towers killed by team.
fttime = First tower kill time, in minutes.
With these factors, we run a linear regression to fit the data.
# tower_kills is a sub-dataframe of team_data based on the "tower kills" factors.
tower_kills = team_data[['ft','firstmidouter','firsttothreetowers','teamtowerkills','fttime']]
# Set X as tower_kills dataframe, # y is already set as our win_percentage
X = tower_kills
# Split data into Train and Test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
# run the linear regression on the training set using sklearn library
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
# run the linear regression on the training set using statsmodel library
towerKillsLRM = smf.OLS(y_train, X_train).fit()
# Predict win percentage using our regression model on the Test data
preds_TowerKills = towerKillsLRM.predict(X_test)
# Predict win percentage using K-fold cross validation on the Test data
KFCV_TowerKills = cross_val_predict(model, X, y, cv=5)
# Plot the predicted values (linear regression, k-fold cross validation) against the actual values
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted based on Tower Kills')
# plot actual values for win_percentage
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
# plot linear regression values for win_percentage
sns.distplot(preds_TowerKills, hist=False, label="Linear Regression Predictions", ax=ax)
# plot linear regression values for k-fold cross validation
sns.distplot(KFCV_TowerKills, hist=False, label="K-fold Cross Validation Predictions", ax=ax)
ax.set(xlabel="Win Percentage", ylabel="Avg Number of Teams")
plt.show()
towerKillsLRM.summary()
creep score
minionkills = Lane minions killed.
monsterkills = Neutral monsters killed.
cspm = Creep score per minute. All creep score variables include minions and monsters.
csat10 = Creep score at 10:00.
csdat10 = Creep score difference at 10:00.
With these multiple factors, we run a linear regression to fit the data.
# minion_kills is a sub-dataframe of team_data based on the "Large Monster Kills" factors.
minion_kills = team_data[['minionkills','monsterkills','cspm','csat10','csdat10']]
# Set X as largeMonsterKills dataframe
# y is already set as our win_percentage
X = minion_kills
# Split data into Train and Test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
# run the linear regression on the training set using sklearn library\
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
# run the linear regression on the training set using statsmodel library
creepScoreLRM = smf.OLS(y_train, X_train).fit()
# Predict win percentage using our regression model on the Test data
preds_CreepScore = creepScoreLRM.predict(X_test)
# Predict win percentage using K-fold cross validation on the Test data
KFCV_CreepScore = cross_val_predict(model, X, y, cv=5)
# Plot the predicted values (linear regression, k-fold cross validation) against the actual values
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted based on Creep Score')
# plot actual values for win_percentage
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
# plot linear regression values for win_percentage
sns.distplot(preds_CreepScore, hist=False, label="Linear Regression Predictions", ax=ax)
# plot linear regression values for k-fold cross validation
sns.distplot(KFCV_CreepScore, hist=False, label="K-fold Cross Validation Predictions", ax=ax)
ax.set(xlabel="Win Percentage", ylabel="Avg Number of Teams")
plt.show()
minionKillsLRM.summary()
wards = Total wards placed (of all types).
wpm = Wards placed per minute (or all types).
wardkills = Total wards cleared/killed (of all types).
wcpm = Total wards cleared/killed per minute (of all types).
visionwards = Vision/control wards placed.
visionwardbuys = Vision/control wards purchased.
With these multiple factors, we run a linear regression to fit the data.
# vision is a sub-dataframe of team_data based on the "vision" factors.
vision = team_data[['wards','wpm','wardkills','wcpm','visionwards','visionwardbuys']]
# Set X as largeMonsterKills dataframe, # y is already set as our win_percentage
X = vision
# Split data into Train and Test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
# run the linear regression on the training set using sklearn library
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
# Fit the Linear Regression on Train split
visionLRM = smf.OLS(y_train, X_train).fit()
# Predict using Test split
preds_Vision = visionLRM.predict(X_test)
# Predict win percentage using K-fold cross validation on the Test data
KFCV_Vision = cross_val_predict(model, X, y, cv=5)
# Plot how the predicted win_ratio compares to actual win ratio
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted based on Vision')
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
sns.distplot(preds_Vision, hist=False, label="Linear Regression Predictions", ax=ax)
sns.distplot(KFCV_Vision, hist=False, label="K-fold Cross Validation Predictions", ax=ax)
ax.set(xlabel="Win Percentage", ylabel="Avg Number of Teams")
plt.show()
visionLRM.summary()
totalgold = Total gold earned from all sources.
earnedgpm = Earned gold per minute.
goldspent = Total gold spent.
gspd = Gold spent percentage difference.
goldat10 = Total gold earned at 10:00.
gdat10 = Gold difference at 10:00.
goldat15 = Total gold earned at 15:00.
gdat15 = Gold difference at 15:00.
With these multiple factors, we run a linear regression to fit the data.
# economics is a sub-dataframe of team_data based on the "Large Monster Kills" factors.
economics = team_data[['totalgold','earnedgpm','goldspent','gspd','goldat10','gdat10','goldat15','gdat15']]
# Set X as largeMonsterKills dataframe, # y is already set as our win_percentage
X = economics
# Split data into Train and Test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
# run the linear regression on the training set using sklearn library
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
# run the linear regression on the training set using statsmodel library
econLRM = smf.OLS(y_train, X_train).fit()
# Predict win percentage using our regression model on the Test data
preds_Econ = econLRM.predict(X_test)
# Predict win percentage using K-fold cross validation on the Test data
KFCV_Econ = cross_val_predict(model, X, y, cv=5)
# Plot the predicted values (linear regression, k-fold cross validation) against the actual values
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted based on Economics')
# plot actual values for win_percentage
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
# plot linear regression values for win_percentage
sns.distplot(preds_Econ, hist=False, label="Linear Regression Predictions", ax=ax)
# plot linear regression values for k-fold cross validation
sns.distplot(KFCV_Econ, hist=False, label="K-fold Cross Validation Predictions", ax=ax)
ax.set(xlabel="Win Percentage", ylabel="Avg Number of Teams")
plt.show()
econLRM.summary()
We will now create a more comprehensive linear regression model based on the most statistically relevant factors we encountered in the seven aforementioned linear regressions.
Most relevant groups of factors:
- champion kills
- large monster kills
- tower kills
- economics
Most relevant factors:
- kpm
- teamdeaths
- teamdragkills
- teamtowerkills
- fttime
- minionkills
- monsterkills
- visionwards
- visionwardbuys
- totalgold
- earnedgpm
- goldspent
# relevant is a sub-dataframe of team_data based on the most relevant factors.
relevant = team_data[['kpm', 'teamdeaths', 'teamdragkills', 'teamtowerkills', 'fttime', 'minionkills', 'monsterkills', 'visionwards', 'visionwardbuys', 'totalgold', 'earnedgpm', 'goldspent']]
# Set X as relevant dataframe, # y is already set as our win_percentage
X = relevant
# Split data into Train and Test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
# run the linear regression on the training set using sklearn library
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
# run the linear regression on the training set using statsmodel library
relevantLRM = smf.OLS(y_train, X_train).fit()
# Predict win percentage using our regression model on the Test data
preds_Relevant = relevantLRM.predict(X_test)
# Predict win percentage using K-fold cross validation on the Test data
KFCV_Relevant = cross_val_predict(model, X, y, cv=5)
# Plot the predicted values (linear regression, k-fold cross validation) against the actual values
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted based on Most Relevant Factors')
# plot actual values for win_percentage
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
# plot linear regression values for win_percentage
sns.distplot(preds_Relevant, hist=False, label="Linear Regression Predictions", ax=ax)
# plot linear regression values for k-fold cross validation
sns.distplot(KFCV_Relevant, hist=False, label="K-fold Cross Validation Predictions", ax=ax)
ax.set(xlabel="Win Percentage", ylabel="Avg Number of Teams")
plt.show()
relevantLRM.summary()
A follow-up analysis would be to account for time relevant factors such as accomplishing certain objectives within a certain time frame (10, 15 minutes). Another route of follow-up analysis would be to include factors regarding champion selection and/or item selection. Since each champion and item is so unique in its own right and also because the data didn't include these factors, we reserved this analysis for possible follow-up.
predictions = lm.predict(X_test)
plt.scatter(y_test, predictions)
plt.xlabel("True Values of win percentage")
plt.ylabel("Predictions of win percentage")
print("Score:", model.score(X_test, y_test))
# Perform 5-fold cross validation
scores = cross_val_score(model, X, y, cv=5)
print("Cross-validated scores:", scores)
KFCV_predictions = cross_val_predict(model, X, y, cv=5)
plt.scatter(y, KFCV_predictions)
accuracy = metrics.r2_score(y, KFCV_predictions)
print("Cross-Predicted Accuracy:", accuracy)
plt.xlabel("True Values of win percentage")
plt.ylabel("Predictions of win percentage")
plt.show()
# read in validation data from summer 2018
# our test and training data was based on summer 2019
dataSummer18 = pd.read_excel('./2018-worlds-match-data-OraclesElixir-2018-11-03.xlsx')
# tidy, preprocess the data
validation_data = dataSummer18.loc[dataSummer18['player'] == 'Team']
validation_data.fillna(0, inplace = True)
team_validation_data = validation_data.groupby(['team']).mean()
win_percentage = team_validation_data[['result']]
# Predict win percentage using all of our regression models on the Validation data
X_ChampKills = team_validation_data[['kpm','teamkills','a']]
preds_champKillsLRM = champ_kills_LRM.predict(X_ChampKills)
X_NotDying = team_validation_data[['teamdeaths']]
preds_NotDying = notDying_LRM.predict(X_NotDying)
X_LargeMonsterKills = team_validation_data[['teamdragkills','herald','heraldtime','fbaron','fbarontime','teambaronkills']]
preds_LargeMonsterKills = largeMonsterKillsLRM.predict(X_LargeMonsterKills)
X_TowerKills = team_validation_data[['ft','firstmidouter','firsttothreetowers','teamtowerkills','fttime']]
preds_TowerKills = towerKillsLRM.predict(X_TowerKills)
X_CreepScore = team_validation_data[['minionkills','monsterkills','cspm','csat10','csdat10']]
preds_CreepScore = creepScoreLRM.predict(X_CreepScore)
X_Vision = team_validation_data[['wards','wpm','wardkills','wcpm','visionwards','visionwardbuys']]
preds_Vision = visionLRM.predict(X_Vision)
X_Econ = team_validation_data[['totalgold','earnedgpm','goldspent','gspd','goldat10','gdat10','goldat15','gdat15']]
preds_Econ = econLRM.predict(X_Econ)
X_Relevant = team_validation_data[['kpm', 'teamdeaths', 'teamdragkills', 'teamtowerkills', 'fttime', 'minionkills', 'monsterkills', 'visionwards', 'visionwardbuys', 'totalgold', 'earnedgpm', 'goldspent']]
preds_Relevant = relevantLRM.predict(X_Relevant)
# Plot the predicted values for win_percentage of all linear regression models
# against the actual values for win_percentage
f, ax = plt.subplots(figsize=(16,11))
plt.title('Data Distribution for Actual and Predicted')
# plot actual values for win_percentage
sns.distplot(win_percentage, hist=False, label="Actual", ax=ax)
# plot linear regression values based on Champion Kills
sns.distplot(preds_champKillsLRM, hist=False, label="Linear Regression Predictions based on Champion Kills", ax=ax)
# plot linear regression values based on Not Dying
sns.distplot(preds_NotDying, hist=False, label="Linear Regression Predictions based on Not Dying", ax=ax)
# plot linear regression values based on LargeMonsterKills
sns.distplot(preds_LargeMonsterKills, hist=False, label="Linear Regression Predictions based on LargeMonsterKills", ax=ax)
# plot linear regression values based on TowerKills
sns.distplot(preds_TowerKills, hist=False, label="Linear Regression Predictions based on TowerKills", ax=ax)
# plot linear regression values based on CreepScore
sns.distplot(preds_CreepScore, hist=False, label="Linear Regression Predictions based on CreepScore", ax=ax)
# plot linear regression values based on Vision
sns.distplot(preds_Vision, hist=False, label="Linear Regression Predictions based on Vision", ax=ax)
# plot linear regression values based on Economics
sns.distplot(preds_Econ, hist=False, label="Linear Regression Predictions based on Economics", ax=ax)
# plot linear regression values based on Most Relevant factors
sns.distplot(preds_Relevant, hist=False, label="Linear Regression Predictions based on Most Relevant Factors", ax=ax)
ax.set(xlabel="Win Percentage", ylabel="Avg Number of Teams")
plt.show()